使用场景

使用webmagic爬取百度榜单的时候,出现超时错误

报错内容

1
2
3
4
5
6
7
09:15:23.811 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection released: [id: 0][route: {}->http://top.baidu.com:80][total kept alive: 0; route allocated: 0 of 100; total allocated: 0 of 5]
09:15:23.814 [pool-1-thread-1] WARN us.codecraft.webmagic.downloader.HttpClientDownloader - download page http://top.baidu.com/category?c=513&fr=topbuzz error
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)

问题分析

因为爬虫框架使用了SocketInputStream 读取网页内容,但是下载的网速很慢,使得超时,设置超时时间如下:

1
2
private Site site = Site.me().setRetryTimes(3).setSleepTime(100).
setUserAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0").setTimeOut(100000);

并没有什么作用。

实际原因:

网速太差,网速好的时候再运行就OK了。